Jack County
Object-centric Binding in Contrastive Language-Image Pretraining
Assouel, Rim, Astolfi, Pietro, Bordes, Florian, Drozdzal, Michal, Romero-Soriano, Adriana
Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual information with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations. Instead, our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives. To that end, we introduce a binding module that connects a scene graph, derived from a text description, with a slot-structured image representation, facilitating a structured similarity assessment between the two modalities. We also leverage relationships as text-conditioned visual constraints, thereby capturing the intricate interactions between objects and their contextual relationships more effectively. Our resulting model not only enhances the performance of CLIP-based models in multi-object compositional understanding but also paves the way towards more accurate and sample-efficient image-text matching of complex scenes.
- North America > Canada > Quebec > Montreal (0.14)
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Texas > Jack County (0.04)
- (2 more...)
Adaptive User-centered Neuro-symbolic Learning for Multimodal Interaction with Autonomous Systems
Recent advances in machine learning, particularly deep learning, have enabled autonomous systems to perceive and comprehend objects and their environments in a perceptual subsymbolic manner. These systems can now perform object detection, sensor data fusion, and language understanding tasks. However, there is a growing need to enhance these systems to understand objects and their environments more conceptually and symbolically. It is essential to consider both the explicit teaching provided by humans (e.g., describing a situation or explaining how to act) and the implicit teaching obtained by observing human behavior (e.g., through the system's sensors) to achieve this level of powerful artificial intelligence. Thus, the system must be designed with multimodal input and output capabilities to support implicit and explicit interaction models. In this position paper, we argue for considering both types of inputs, as well as human-in-the-loop and incremental learning techniques, for advancing the field of artificial intelligence and enabling autonomous systems to learn like humans. We propose several hypotheses and design guidelines and highlight a use case from related work to achieve this goal.
- Europe > Germany > Saarland > Saarbrücken (0.04)
- North America > United States > Texas > Jack County (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (3 more...)
- Education (0.47)
- Health & Medicine (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (0.89)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.81)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
Men Also Do Laundry: Multi-Attribute Bias Amplification
Zhao, Dora, Andrews, Jerone T. A., Xiang, Alice
As computer vision systems become more widely deployed, there is increasing concern from both the research community and the public that these systems are not only reproducing but amplifying harmful social biases. The phenomenon of bias amplification, which is the focus of this work, refers to models amplifying inherent training set biases at test time. Existing metrics measure bias amplification with respect to single annotated attributes (e.g., $\texttt{computer}$). However, several visual datasets consist of images with multiple attribute annotations. We show models can learn to exploit correlations with respect to multiple attributes (e.g., {$\texttt{computer}$, $\texttt{keyboard}$}), which are not accounted for by current metrics. In addition, we show current metrics can give the erroneous impression that minimal or no bias amplification has occurred as they involve aggregating over positive and negative values. Further, these metrics lack a clear desired value, making them difficult to interpret. To address these shortcomings, we propose a new metric: Multi-Attribute Bias Amplification. We validate our proposed metric through an analysis of gender bias amplification on the COCO and imSitu datasets. Finally, we benchmark bias mitigation methods using our proposed metric, suggesting possible avenues for future bias mitigation
- North America > United States > Texas > Jack County (0.04)
- North America > United States > New York (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
Towards the Automatic Generation of Conversational Interfaces to Facilitate the Exploration of Tabular Data
Gomez, Marcos, Cabot, Jordi, Clarisó, Robert
Tabular data is the most common format to publish and exchange structured data online. A clear example is the growing number of open data portals published by all types of public administrations. However, exploitation of these data sources is currently limited to technical people able to programmatically manipulate and digest such data. As an alternative, we propose the use of chatbots to offer a conversational interface to facilitate the exploration of tabular data sources. With our approach, any regular citizen can benefit and leverage them. Moreover, our chatbots are not manually created: instead, they are automatically generated from the data source itself thanks to the instantiation of a configurable collection of conversation patterns.
- North America > United States > Texas > Kleberg County (0.04)
- North America > United States > Texas > Jack County (0.04)
- North America > United States > Texas > Chambers County (0.04)
- (3 more...)
- Research Report (0.64)
- Workflow (0.46)
Army looks to block data 'poisoning' in facial recognition, AI - FedScoop
The Army has many data problems. But when it comes to the data that underlies facial recognition, one sticks out: Enemies want to poison the well. Adversaries are becoming more sophisticated at providing "poisoned," or subtly altered, data that will mistrain artificial intelligence and machine learning algorithms. To try and safeguard facial recognition databases from these so-called backdoor attacks, the Army is funding research to build defensive software to mine through its databases. Since deep learning algorithms are only as good as the data they rely on, adversaries can use backdoor attacks to leave the Army with untrustworthy AI or even bake-in the ability to kill an algorithm when it sees a particular image, or "trigger." "People tend to modify the input data very slightly so it is not so obvious to a human eye, but can fool the model," said Helen Li, a Duke University faculty member whose research team received $60,000 from the Army Research Office for work on an AI database defensive software.
- North America > United States > Texas > Jack County (0.05)
- North America > United States > New York (0.05)
Generated Loss, Augmented Training, and Multiscale VAE
The variational autoencoder (VAE) framework remains a popular option for training unsupervised generative models, especially for discrete data where generative adversarial networks (GANs) require workaround to create gradient for the generator. In our work modeling US postal addresses, we show that our discrete VAE with tree recursive architecture demonstrates limited capability of capturing field correlations within structured data, even after overcoming the challenge of posterior collapse with scheduled sampling and tuning of the KL-divergence weight $\beta$. Worse, VAE seems to have difficulty mapping its generated samples to the latent space, as their VAE loss lags behind or even increases during the training process. Motivated by this observation, we show that augmenting training data with generated variants (augmented training) and training a VAE with multiple values of $\beta$ simultaneously (multiscale VAE) both improve the generation quality of VAE. Despite their differences in motivation and emphasis, we show that augmented training and multiscale VAE are actually connected and have similar effects on the model.
- North America > United States > Vermont (0.05)
- North America > United States > Texas > Jack County (0.04)
- North America > United States > California > Santa Clara County > Mountain View (0.04)
nocaps: novel object captioning at scale
Agrawal, Harsh, Desai, Karan, Chen, Xinlei, Jain, Rishabh, Batra, Dhruv, Parikh, Devi, Lee, Stefan, Anderson, Peter
Image captioning models have achieved impressive results on datasets containing limited visual concepts and large amounts of paired image-caption training data. However, if these models are to ever function in the wild, a much larger variety of visual concepts must be learned, ideally from less supervision. To encourage the development of image captioning models that can learn visual concepts from alternative data sources, such as object detection datasets, we present the first large-scale benchmark for this task. Dubbed 'nocaps', for novel object captioning at scale, our benchmark consists of 166,100 human-generated captions describing 15,100 images from the Open Images validation and test sets. The associated training data consists of COCO image-caption pairs, plus Open Images image-level labels and object bounding boxes. Since Open Images contains many more classes than COCO, more than 500 object classes seen in test images have no training captions (hence, nocaps). We evaluate several existing approaches to novel object captioning on our challenging benchmark. In automatic evaluations these approaches show modest improvements over a strong baseline trained only on image-caption data. However, even when using ground-truth object detections, the results are significantly weaker than our human baseline - indicating substantial room for improvement.
- Transportation > Ground > Road (1.00)
- Leisure & Entertainment > Sports (1.00)
- Government (0.68)